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Abstract 

In  this  paper  we  examine  an  alternative  interface  for  phonetic 
search,  namely  query-by-example,  that  avoids  OOV  issues  as¬ 
sociated  with  both  standard  word-based  and  phonetic  search 
methods.  We  develop  three  methods  that  compare  query  lat¬ 
tices  derived  from  example  audio  against  a  standard  ngram- 
based  phonetic  index  and  we  analyze  factors  affecting  the  per¬ 
formance  of  these  systems.  We  show  that  the  best  systems  under 
this  paradigm  are  able  to  achieve  11%  precision  when  retrieving 
utterances  from  conversational  telephone  speech  and  returning 
10  results  from  a  single  query  (performance  that  is  better  than 
a  similar  dictionary-based  approach)  suggesting  significant  util¬ 
ity  for  applications  requiring  high  precision.  We  also  show  that 
these  systems  can  be  further  improved  using  relevance  feed¬ 
back:  By  incorporating  four  additional  queries  the  precision  of 
the  best  system  can  be  improved  by  13.7%  relative.  Our  sys¬ 
tems  perform  well  despite  high  phone  recognition  error  rates 
(>  40%)  and  make  use  of  no  pronunciation  or  letter- to- sound 
resources. 

Index  Terms:  spoken  term  detection,  string-edit  distance,  key¬ 
word  spotting,  spoken  document  retrieval 

1.  Introduction 

As  the  volume  of  recorded  speech  data  has  increased  through 
sources  such  as  podcasts,  internet  audio,  etc.,  the  need  for  tech¬ 
nologies  that  allow  fast  access  to  this  data  based  on  content 
has  also  grown.  Research  in  content-based  retrieval  methods 
from  spoken  data  has  a  long  tradition  in  both  the  information 
retrieval  and  speech  recognition  communities.  A  number  of 
content-based  retrieval  methods  have  been  explored  including 
topic  detection/tracking,  spoken  term  detection,  spoken  docu¬ 
ment  retrieval,  etc.  This  research  has  been  supported  by  mul¬ 
tiple  evaluations  including  the  recent  2006  NIST  Spoken  Term 
Detection  (STD)  evaluation. 

In  this  paper  we  focus  on  the  retrieval  of  spoken  utterances 
containing  search  terms  specifically  in  context  of  low-resource 
spoken  term  detection.  Much  of  the  work  done  to  date  has 
focused  on  spoken  term  detection  in  languages  and  domains 
where  transcribed  speech  and  phonetic  lexicon  resources  are 
widely  available.  As  such,  the  best  current  methods  make  heavy 
use  of  word-based  speech  recognition  during  the  indexing  pro¬ 
cess.  These  systems  assume  that  1)  well-trained  recognizers  are 
available  for  the  STD  domain/language  of  interest  and  that  2) 
the  search  vocabulary  is  likely  to  be  well  covered  by  the  lan- 
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guage  models  used  during  indexing  (i.e.  low  OOV  for  query 
terms). 

Many  state  of  the  art  systems  also  make  use  of  phonetic 
search,  especially  when  language  resources  are  scarce.  As  pho¬ 
netic  recognition  is  typically  more  error  prone  during  index¬ 
ing,  soft-matching  procedures  are  needed  to  account  for  these 
errors  during  search.  These  systems  typically  assume  that  at 
query  time,  an  orthographic  representation  of  the  search  term 
can  be  converted  into  a  phonetic  representation  for  comparison 
against  an  index  using  either  pronunciation  lexica  or  grapheme- 
to-phoneme  conversion  algorithms.  Such  systems  can  be  sub¬ 
ject  to  OOV  issues  as  grapheme-to-phoneme  conversion  is  not 
always  available  and  query  terms  may  not  be  found  in  pronun¬ 
ciation  lexica. 

In  this  work  we  examine  an  alternative  interface  for  pho¬ 
netic  search,  namely  query-by-example,  that  avoids  OOV  issues 
associated  with  both  standard  word-based  and  phonetic  search 
methods.  We  assume  that  a  user  provides  query  examples  ei¬ 
ther  from  excised  speech  cuts  corresponding  to  a  search  term 
or  via  a  speech  recording  interface.  The  resulting  audio  is  used 
to  then  create  a  query  lattice  for  comparison  against  a  standard 
phonetic  index.  We  compare  three  methods  for  the  retrieval  of 
utterances  from  ngram-based  phonetic  indices  given  a  query  lat¬ 
tice  and  we  analyze  factors  affecting  the  performance  of  these 
systems.  We  show  that  the  best  systems  under  this  paradigm 
are  able  to  achieve  77%  precision  on  conversational  telephone 
speech  when  returning  10  results  (performance  that  is  better 
than  a  similar  dictionary-based  approach),  suggesting  signifi¬ 
cant  utility  for  applications  requiring  high  precision.  We  also 
show  that  the  system  can  be  further  improved  using  relevance 
feedback:  By  incorporating  four  additional  queries  the  preci¬ 
sion  of  the  best  system  can  be  improved  by  13.7%  relative.  Our 
systems  perform  well  despite  high  phone  recognition  error  rates 
(>  40%)  and  make  use  of  no  pronunciation  or  letter- to- sound 
resources. 

We  treat  the  problem  as  a  ‘string-distance  comparison’  to 
leverage  and  improve  existing  stochastic  string-distance  meth¬ 
ods  [1,  2,  3].  We  develop  two  types  of  algorithms  to  calculate 
confusion  network  to  confusion  network  (lattice  to  lattice)  dis¬ 
tance  based  on  1-1  alignments  and  string-edit  distance.  The 
strength  of  the  former  is  the  ease  of  implementation  and  speed, 
while  the  latter  can  cope  with  misalignments  by  explicitly  mod¬ 
eling  insertions  and  deletions. 

1.1.  Related  Work 

Significant  work  has  been  done  on  the  use  of  phonetic  units  for 
content-based  retrieval  from  speech  (e.g.  [4,  5,  6]).  Though  our 
work  is  closely  related  to  these  efforts,  we  do  not  use  pronunci¬ 
ations  derived  from  pronunciation  lexica  or  letter- to- sound  sys- 
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terns.  Furthermore,  we  make  use  of  a  lattice  of  possible  alterna¬ 
tives  from  both  the  query  term  and  search  indices.  This  aspect 
of  our  matching  procedure  is  related  to  Spoken  Document  Re¬ 
trieval  (SDR)  work  done  by  [7]  and  [8],  suggesting  that  lattice 
representations  of  search  indices  and  queries  can  improve  SDR 
performance. 

Less  work  has  been  done  involving  methods  for  speech 
search  by  example.  In  [9],  the  authors  describe  a  method  for 
example-based  query  generation  for  general  search.  In  related 
music  retrieval  work,  [10]  attempts  to  retrieve  drum  loops  from 
onomatopoeic  queries. 

2.  Algorithms  for  Query-to-Index  Matching 

In  this  section  we  describe  three  different  approaches  to  the 
query-by-example  search  problem.  All  systems  assume  that  a 
compact  phone  ngram  index  is  used.  The  form  of  this  index  is 
essentially  a  pruned  confusion  network  as  shown  in  Figure  1.  In 
our  particular  implementation,  we  keep  epsilon  arc  information 
for  use  during  term  matching.  This  representation  allows  very 
compact  indexing  of  large  data  sets. 
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Figure  1:  Example  1-gram  phonetic  index  for  the  term  allow 

All  three  models  presented  below  can  be  viewed  as  stochas¬ 
tic  measures  of  string-to- string  distance/similarity.  In  [1] 
and  [3]  cost  parameters  are  learned  from  supervised  data  for  ei¬ 
ther  HMM  or  CRF  models,  while  in  our  case,  these  costs  are 
fixed  as  our  systems,  by  design,  make  no  use  of  supervised 
data  other  than  the  query.  As  in  [2],  we  compare  the  perfor¬ 
mance  of  different  alignment  procedures  for  scoring  a  query 
lattice  against  a  large  indexed  database  of  speech.  Our  align¬ 
ment  models  vary  from  1)  fixed  1-1  alignment,  to  2)  Viterbi 
string  edit  distance,  to  3)  full  Forward  probability.  As  in  [2] 
and  [4]  our  systems  make  use  of  term  weighting  (either  IDF  or 
a  background  model)  to  improve  scoring. 

2.1.  Direct  Index  Matching 

Our  first  system  assumes  queries  and  indices  do  not  require 
alignment  during  the  matching  procedure.  At  query  time,  we 
convert  the  query  lattice  into  an  ngram  index  (order  2)  as  de¬ 
scribed  above.  The  resulting  query  structure  is  then  compared 
against  indexed  speech  of  the  same  form  column-by-column  in 
sliding  windows  over  all  utterances.  Scores  are  computed  for 
each  sliding  window.  This  system  is  inspired  by  [4]  though  it 
differs  from  this  approach  in  that  we  enforce  a  1-1  query-to- 
index  alignment  scheme  in  place  of  the  bag-of-ngrams  model. 
Also,  empirically,  we  found  it  better  to  use  a  likelihood  ratio  in 
place  IDF  weighting.  Note:  this  procedure  does  not  allow  for 
insertions  or  deletions.  Equation  1  shows  the  scoring  function 
for  used  for  each  column. 

p(IilQi)  =  E{p\Ij)*p{p\Qi)  (1) 

peij 

where  Qi  and  Ij  represent  the  set  of  ngram  posteriors  associ¬ 
ated  with  columns  i  and  j  of  the  query  and  index  respectively 
and  p  is  a  phone  ngram  in  both  Qi  and  Ij.  E{p\lj)  represents 
the  expected  value  of  a  phone  ngram  p  in  index  I  j . 


For  each  window  with  index  offset  w,  we  compute  the  joint 
sequence  probability: 

IQI 

KI|Q.w)=nP(Ii+w|Qi)  (2) 

i=l 

For  detection  purposes,  we  compute  the  likelihood  ratio  of  joint 
sequence  probability  against  a  background  model: 


LR{Q\w,I) 

p{l\Xbkg,w) 

(3) 

IQI 

p{I\w,\bkg) 

—  Wp(^i+w\Xbkg) 

(4) 

i=l 


IQI 

=  n  E  E(p\Ii+w)  *  P{p\hkg)  (5) 

i=i  peii+w 

We  estimate  the  background  model  parameters  p{p\Xbkg)  from 
recognizer  transcripts  on  a  lOO-i-  hour  unlabeled  set  of  Fisher 
data. 

2.2.  Edit  Distance  Alignment  System 

Consider  two  ngram  indices  Q  and  I,  a  query  and  an  indexed 
utterance  respectively.  A  column  i  in  such  an  index  defines  a 
distribution  p{p\li)  over  symbols  p  (phones)  drawn  from  an  al¬ 
phabet  A.  We  would  like  to  determine  whether  I  contains  an 
instance  of  Q.  To  do  this  we  will  compute  a  string-edit  distance 
in  which  we  explicitly  model  insertions  and  deletions  of  index 
columns. 

We  consider  the  string-edit  distance  between  strings  of 
symbols  in  A  to  be  defined  by  the  tuple  f  =<  A,p{E)  >, 
where  p{E)  is  a  probability  distribution  over  edits  p  :  E  ^ 
[0, 1],  with  E  =  Eg  U  Ed  U  Ei  (Eg  =  A  x  A  set  of  substi¬ 
tutions,  Ed  =  A  X  {e}  set  of  deletions,  Ei  =  {e}  x  A  set  of 
insertions). 

Let  z  AT  be  an  A^-length  sequence  of  edit  operations  Zi  ^  E 
required  to  transduce  any  path  through  a  query  index  Q  into  any 
path  though  confusion  network  I,  with  Zi  =<  Qj,I/c  >  the 
edit  event  of  transducing  column  Qj  into  I/c,  and  p{'LN\(t>)  the 
product  of  the  probabilities  p{zi)  of  the  specific  edit  operations. 

This  formulation  naturally  leads  to  two  distances: 

I)  calculates  the  distance  between  two  confusion  net¬ 
works  as  the  mostly  likely  transduction  between  any  path  in  Q 
with  any  path  in  I,  and  dj ^^(Q,  I)  sums  over  all  possible  trans¬ 
ductions  between  all  members  of  Q  and  I. 

The  first  distance  (Q,  I),  a  generalization  of  the  Viterbi 
path,  is  the  negative  logarithm  of  the  probability  of  the  most 
likely  edit  sequence  between  any  path  through  Q  into  any  path 
through  I: 

= -Zog(maXzNp(zAr|0))  (6) 

p(zjv|0)  =  (7) 

P(«i)  =  E  E  \piPk)PWk)  +  ]^p{q\p)p{q\Q.j)  («) 

VqeQj  VpGifc 

where  p  and  q  are  phone  ngrams  within  columns  Ik  and  Qj 
respectively  and  p(p|I/c)-like  terms  taken  from  the  posterior 
distribution  of  the  confusion  network  at  column  k.  While  the 

Ip) -like  terms  could  be  learned,  we  opt  for  the  unsupervised 
standard  edit-distance  of  p('|')  as  1  for  a  match,  and  0  for  sub¬ 
stitutions,  insertions,  and  deletions.  Also,  note  that  for  imple¬ 
mentation  the  networks  were  converted  into  bigram  networks, 
while  using  the  above  equations  for  each  unigram  in  a  column. 


2.2.1.  Hybrid  distance 

In  [2]  the  variants  of  edit-distance  performed  best  when  com¬ 
bined  with  TF-IDF.  Here  we  combine  the  two  by  using  the 
alignment  generated  from  *^(Q,  I),  while  weighting  each  of 
the  bigram  matches  with  the  IDFp  factor,  and  sum  the  score  in 
addition  to  a  unigram  match  score. 

2.3.  HMM  Alignment  System 

In  addition  to  an  edit  distance-based  alignment,  we  employed 
a  discrete  HMM  model  for  matching.  In  this  framework,  we 
treat  the  query  as  a  discrete  HMM  in  which  each  column  of 
the  ngram  index  is  interpreted  as  an  HMM  state  and  unigram 
probabilities  are  used  as  observation  probabilities.  Epsilon  arcs 
in  the  index  structure  are  interpreted  as  skip  arc  probabilities 
and  a  fixed  stay  probability  parameter  allows  for  insertions  to 
be  observed  in  indexed  utterances  given  the  query  as  a  model. 


0.1 


Figure  2:  Example  HMM  with  fixed  stay  probability  of  0.1 

Figure  2  shows  an  example  of  the  resulting  query/model 
corresponding  to  the  index  shown  in  Figure  1 .  During  search, 
the  discrete  HMM  model  of  the  query  is  aligned  to  the  index  to 
compute: 

k 

Vs  i=l 
k 

Vs  i=l  pGli 

The  E{p\li)  term  in  the  above  equation  uses  the  posterior  of 
each  phone  p  in  each  index  column  i.  This  formulation  allows 
for  full  alignment  of  all  index  alternatives.  Scoring  is  done  by 
computing  full  Forward  probability  of  the  index  /  given  the 
query  model  Xq.  Note:  with  fixed  transition,  skip,  and  stay 
probabilities,  this  model  computes  as  described  in  the  pre¬ 
vious  section.  As  we  make  use  of  per  state  skip  information 
derived  from  the  query  index,  these  models  are  not  fully  equiv¬ 
alent. 

3.  Experimental  Design 

We  evaluated  the  ability  of  the  methods  proposed  above  to  re¬ 
trieve  utterances  containing  query  terms  on  conversational  tele¬ 
phone  speech  data  from  the  Fisher  corpus  and  we  compared 
these  methods  to  a  retrieval  system  that  makes  use  of  a  pronun¬ 
ciation  lexicon  as  a  baseline.  From  transcripts  provided  by  LDC 
we  extracted  248  single- word  terms  with  an  average  of  19.2  tar¬ 
get  trials  per  term  (minimum:  10,  maximum:  500).  An  addi¬ 
tional  set  of  967  utterances  were  also  extracted  as  non-target 
trials. 

For  each  term  we  constructed  a  held-out  set  of  1,843  query 
examples  (minimum:  1,  maximum:  10,  average:  7.4  examples 
per  term)  extracted  from  continuous  speech  utterances.  Words 
were  selected  from  the  STD  2006  Evaluation  term  list  with  a 


minimum  of  5  orthographic  characters.  This  was  done  in  order 
to  avoid  sampling  of  unlikely  search  terms  (e.g.  short  function 
words). 

Eor  the  purpose  of  these  experiments  each  trial  is  a  single 
utterance  (between  0.5-6  secs)  and  we  measured  the  ability  of 
each  system  to  return  utterances  containing  each  target  term. 
We  evaluated  performance  in  terms  of  precision  at  N,  where  N 
is  the  number  of  target  trials  for  that  word.  This  measures  the 
operating  point  where  precision  is  equal  to  recall.  Eor  example, 
the  term  actor  has  22  target  and  967  non-target  trials.  In  this 
case,  we  measure  the  performance  of  a  detector  for  this  term 
by  sorting  its  scores,  and  report  precision  for  the  top  22  scoring 
hits.  We  also  report  precision  at  10,  the  typical  Google  metric, 
as  well  as  the  Equal-Error  Rate  (EER).  In  all  cases,  the  perfor¬ 
mance  per  term  is  averaged. 

We  measure  the  performance  of  each  system  using  sin¬ 
gle  and  multiple  query  examples.  Performance  using  multiple 
query  examples  was  examined  in  order  to  assess  how  these  sys¬ 
tems  could  perform  with  user  feedback,  i.e.  using  a  preliminary 
query  example  and  its  search  results,  users  might  expand  the  set 
of  examples  corresponding  to  a  given  query. 

Having  many  spoken  queries  for  some  words  invites  a 
method  to  combine  them,  or  the  scores  resulting  from  their  use. 
Eor  systems  making  use  of  multiple  queries  we  employed  ei¬ 
ther  a  max  or  avg  score  criterion  for  combining  results  on  the 
utterance  level. 

Eor  all  experiments  phonetic  indices  were  created  by 
running  a  neural  network-based  TRAPs  phonetic  recognizer 
trained  on  10  hours  of  Switchboard2,  phase  4  data.  This  rec¬ 
ognizer  was  also  used  to  generate  models  from  query  examples 
using  the  aforementioned  methods. 

4.  Results 

Table  1  shows  results  for  all  methods  at  the  three  operating 
points  described  above  on  our  Eisher  test  set  using  single  and 
multiple  query  examples.  The  best  method  for  combining  mul¬ 
tiple  query  returns  varies  for  different  systems.  We  observe  that 
the  use  of  multiple  queries  improves  system  precision  by  13.7% 
@10.  This  suggests  that  users  would  be  able  to  provide  rele¬ 
vance  feedback  to  these  systems  for  improved  performance. 

Overall  all  systems  perform  comparably  with  the  discrete 
HMM  model  being  slightly  better  at  most  operating  points.  All 
systems  benefit  from  the  use  of  IDE  term  weighting  and/or  log 
likelihood  ratio  (Hr)  scoring  against  a  background  model. 

This  table  also  compares  the  result  of  the  best  system 
(dhmm)  using  queries  constructed  from  examples  and  queries 
from  dictionary  pronunciations.  Query-by-example  outper¬ 
forms  the  pronunciation-based  model  when  a  single  query  or 
pronunciation  is  used.  This  suggests  that  the  underlying  pho¬ 
netic  recognizer  is  making  consistent  “errors”  between  the  in¬ 
dex  and  the  query.  Eew  query  terms  have  more  than  one  pro¬ 
nunciation  (avg.  1.1  prons.  per  term),  as  a  result,  there  is  little 
improvement  when  all  pronunciations  are  combined. 

We  examined  the  performance  of  our  best  system  as  a  func¬ 
tion  of  the  length  of  the  input  query.  Eigure  3  plots  each  pre¬ 
cision  metric  as  a  function  of  query  length  (in  characters).  Al¬ 
though  the  variability  in  performance  at  specific  queries  lengths 
is  high,  generally,  performance  improves  at  all  operating  points 
as  a  function  of  the  query  length.  Note  that  for  very  short 
queries,  system  performance  is  highly  affected  by  indexing  er¬ 
rors  associated  with  missed  columns  during  query  extraction. 
This  is  due  to  the  fact  that  we  derive  only  approximate  tim¬ 
ing  information  from  the  ASR  phone  lattice  during  confusion 


System 

Configuration 

P@10 

P@N 

EER 

one  query 

70.66 

49.95 

20.57 

one  query  ( idf) 

73.65 

51.90 

19.70 

direct 

one  query  (Ur) 

74.51 

52.29 

19.53 

all  queries  (avg-\-llr) 

79.15 

58.84 

14.84 

all  queries  (max+llr) 

75.44 

55.23 

16.06 

one  query 

63.44 

44.83 

23.78 

df 

one  query  ( idf) 

69.41 

48.18 

22.01 

all  queries  (avg-\-idf) 

80.85 

58.31 

15.08 

all  queries  (max-\-idf) 

77.74 

54.82 

16.57 

one  query 

74.83 

53.04 

20.80 

dhmm 

one  query  (Ur) 

76.99 

53.94 

20.48 

all  queries  (avg+llr) 
all  queries  (max+llr) 

81.17 

82.26 

54.92 

61.34 

18.04 

13.98 

dhmm  + 
pron  lex. 

one  diet  entry  (Ur) 
all  diet  entries  (avg+llr) 

73.01 

73.99 

47.66 

48.16 

21.11 

20.92 

all  diet  entries  (max+llr) 

74.27 

48.26 

20.93 

Table  1:  Performance  of  each  QbE  system  (aggregation  and 
background  model  indicated  in  parenthesis,  best  single  and 
multi-query  results  are  indicated  in  boldj 


network  construction.  Table  2  shows  the  1-best  posterior  de¬ 
coding  results  for  multiple  query  instances  of  the  word  allow. 
As  shown  in  this  example,  extraneous  and  missing  columns  are 
common  due  to  timing  issues.  From  manual  inspection,  we 
found  that  timing  issues  dominate  differences  in  the  number 
of  indexed  columns  extracted  for  different  queries  of  the  same 
term  rather  than  deletion  or  insertion  errors  from  the  recognizer 
or  pronunciation  differences. 


AX  L  AW 

L  UH 

1-best  indices  for  allow 

AY  L  AE  T 

N  AX  L  OW 

L 

Table  2:  Examples  of  indexing  errors  for  the  term  allow 


5.  Discussion 

All  of  the  methods  we  present  in  this  paper  show  promising  per¬ 
formance  on  the  utterance  retrieval  task.  Our  results  suggest  that 
making  use  of  lattice-to-lattice  alignments  between  queries  and 
indices  can  improve  STD  performance.  Similarly,  term  weight¬ 
ing  either  through  IDF  or  through  the  use  of  a  background 
model  can  provide  significant  performance  improvements. 

Performance  of  these  systems  is  significantly  influenced  by 
the  length  of  query.  Generally,  longer  queries  yield  better  re¬ 
sults.  Our  experiments  also  show  that  use  of  relevance  feed¬ 
back  can  further  improve  performance  by  as  much  as  13%  rela¬ 
tive  even  using  rather  naive  combination  methods.  It  should  be 
possible  to  allow  the  HMM-based  system  to  create  aggregate 
models  via  query  alignment.  Further  work  is  needed  to  assess  if 
this  improves  performance  when  multiple  queries  are  available. 

In  terms  of  computational  cost,  the  baseline  direct  matcher 
is  very  fast  as  ngrams  can  accessed  in  constant  time.  The  other 
two  systems  make  use  of  sequence  alignment.  With  our  imple¬ 
mentation  of  the  discrete  hmm,  alignment  can  be  done  in  270% 
xRT  (faster)  and  could  be  made  faster  by  first  applying  an  direct 
match  first  to  select  regions  for  alignment-based  matching. 

These  systems  perform  well  despite  poor  phone  recognition 
and  indexing  errors  associated  with  approximate  confusion  net¬ 
work  timing.  We  expect  that  better  indexing  methods  could  be 
employed  to  further  improve  these  systems. 


— P@10 

-■-P@N 


Multi-query  Performance 


Figure  3:  Precision  vs.  length  for  single  and  multiple  queries 


5.1.  Acknowledgments 

We  would  like  to  thank  Ken  Basye  and  Hugh  Seeker- Walker  for 
their  lattice-to-lattice  distance  and  general  purpose  graph  tools. 

6.  References 

[1]  E.  Ristad  and  P.  Yianilos,  “Learning  string-edit  distance,” 
IEEE  Trans,  on  Pattern  Analysis  Machine  Intelligence, 
1998. 

[2]  WW  Cohen,  P.  Ravikumar,  and  S.E.  Eienberg,  “A 
comparison  of  string  distance  metrics  for  name-matching 
tasks,”  in  IJCAI,  2003,  pp.  73-78. 

[3]  A.  McCallum,  K.  Bellare,  and  E.  Pereira,  “A  conditional 
random  field  for  discriminatively-trained  finite- state  string 
edit  distance,”  in  Conference  on  Uncertainty  inAI(UAI), 
2005. 

[4]  K.  Ng  and  V.  Zue,  “Subword-based  approaches  for  spoken 
document  retrieval,”  in  PhD.  Thesis,  2000. 

[5]  L.  Burget  et  al.,  “Indexing  and  search  methods  for  spoken 
documents,”  in  ICTSD,  2006. 

[6]  T.  Hori,  I.  Lee  Hetherington,  Timothy  J.  Hazen,  and 
J.  Glass,  “Open- vocabulary  spoken  utterance  retrieval  us¬ 
ing  confusion  networks,”  in  ICASSP,  2007. 

[7]  M.  Saraclar  and  R.  Sproat,  “Lattice-based  search  for  spo¬ 
ken  utterance  retrieval,”  HLT-NAACL,  2004. 

[8]  T.  Kiah  Chia,  K.  Chai  Sim,  Haizhou  L.,  and  H.  Tou  Ng,  “A 
lattice-based  approach  to  query-by-example  spoken  docu¬ 
ment  retrieval,”  in  SIGIR,  2008,  pp.  363-370. 

[9]  H.  Murao,  N.  Kawaguchi,  S.  Matsubara,  and  Y.  Ina- 
gaki,  “Example-based  query  generation  for  spontaneous 
speech,”  lEICE  -  Trans.  Inf.  Syst.,  2005. 

[10]  O.  Gillet  and  G.  Richard,  “Drum  loops  retrieval  from  spo¬ 
ken  queries,”  Journal  of  Intelligent  Information  Systems, 
2005. 


